In [20]:
import sys
import sumr
reload(sumr)
base_path = 'quarter_example/'
import os
import itertools
from bs4 import BeautifulSoup

AIG's First Quarter 2009 Earnings Announcement

As an example, we're going to create a single document summarization for AIG's 2009 first quarter earnings announcement. The full text of the announcement can be found here.

As you can see, most of the pertinent information is concentrated in the first 10 pages of the document. Digging deeper, the main hightlights can be found in the first paragraph:

American International Group, Inc. (AIG) today reported a net loss for the first quarter of 2009 of \$4.35 billion or \$1.98 per diluted share, compared to a net loss of \$7.81 billion or \$3.09 per diluted share in the first quarter of 2008. First quarter 2009 adjusted net loss, excluding net realized capital gains (losses) and FAS 133 gains (losses), net of tax, was \$1.60 billion, compared to an adjusted net loss of \$3.56 billion in the first quarter of 2008.

These are the headline numbers that most analysts and investors pay attention to, specifically it contains the earnings figure and a comparision with the previous quarters earnings. For that reason each of the below algorithms are designed to return the first sentence of the first paragraph for each earnings announcement. However, I do not show this in the results. A separate function call naively returns this sentence for each document.

The traditional measure for summarization quality is called Recall-Oriented Understudy for Gisting Evaluatio (ROUGE)) This metric basically looks at the overlap in orders (as defined by some N-Gram window) between the automatically generated summary and one that was created manually by a human.

For simplicitys sake, I haven't generated a ROUGE score because we're only generating 6 key sentences for each of our documents. I start with the ideal summary and then move on to generating single document summaries using each of the three summarization algorithms.

Ideal Summary (Manually created)

American International Group, Inc. (AIG) today reported a net loss for the first quarter of 2009 of \$4.35 billion or \$1.98 per diluted share, compared to a net loss of \$7.81 billion or \$3.09 per diluted share in the first quarter of 2008.

AIG reported a \$1.9 billion pre-tax (\$1.2 billion after tax) charge for restructuring costs, primarily related to the wind down of AIG Financial Products Corp., AIG Trading Group, Inc. and their subsidiaries (collectively, AIGFP) and other.

AIG reported market disruption-related losses of \$2.5 billion pre-tax (\$1.6 billion after tax).

The Federal Reserve Bank of New York (FRBNY) Credit Agreement was amended to remove the minimum 3.5 percent LIBOR floor as of April 17, 2009.

The stabilization of rates is an improvement from the fourth quarter of 2008 and reflects the current market conditions.

The foreign exchange effect for the first quarter of 2009 was a reduction of reserves of \$290 million.

In [25]:
doc_list = os.listdir(base_path)
In [26]:
CleanText = sumr.TextCleaner(doc_list)
CleanText.read_docs(base_path = base_path)
tf_idf = CleanText.make_tf_idf()
In [27]:
NaiveSum = sumr.NaiveSumr(CleanText)

The naive summarization uses the following features:

  1. Summation based selection
  2. Density based selection
  3. Keywords
  4. Sentence position
  5. TF-IDF weights

I weight each sentence as a function of the above features. Compared to our baseline, this leads to 3 out of the 5 manually generated summary sentences.

In [33]:
NaiveSum.Summarize(doc_list[0])
Out[33]:
[u'The Federal Reserve Bank of New York (FRBNY) Credit Agreement was amended to remove the minimum 3.5 percent LIBOR floor as of April 17, 2009.',
 u'The stabilization of rates is an improvement from the fourth quarter of 2008 and reflects the current market conditions.',
 u'Foreign General net premiums written in the first quarter of 2009 were $3.6 billion, a 10.3 percent decline in original currency, or 18.1 percent including the effect of foreign exchange.',
 u'The foreign exchange effect for the first quarter of 2009 was a reduction of reserves of $290 million.',
 u'For the first quarter of 2009, net adverse loss development from prior accident years, excluding accretion of loss reserve discount, was $64 million.']

The Textrank summarization uses the algorithm outlined in (Mihalcea & Tarau, 2004)

Specifically, we update edge weights according to the following formula:

$WS(N_{i}) = (1-\alpha) + \alpha * \sum_{N_{j} \in In(N_{i})} \frac{w_{ji}}{\sum_{N_{k} \in Out(N_{j})} w_{jk}}WS(N_{j})$

Here, $N$ denotes a node. $In(N)$ is the set of nodes with directed edges into node N, $Out(N)$ is the set of nodes with directed edges coming from node N, $\alpha$ is a dampening parameter that accounts for "jump" probability between nodes, and $w$ is the edge weights between nodes.

I set $\alpha$ to 0.7 and use levenshtein distance as the weighting between nodes. Nodes are complete sentences within the document.

This results in 2 out of 5 of the manually generated summary sentences

In [36]:
TextRankSum = sumr.SumrGraph(CleanText)
TextRankSum.Summarize(doc_list[0])
Out[36]:
[u'Premiums and other considerations declined 10.5 percent, to $8.3 billion in the first quarter of 2009.',
 u'AIG reported market disruption-related losses of $2.5 billion pre-tax ($1.6 billion after tax).',
 u'The foreign exchange effect for the first quarter of 2009 was a reduction of reserves of $290 million.',
 u'These items were partially offset by a $1.7 billion favorable credit valuation adjustment.',
 u'Reserves were also reduced by $287 million due to dispositions.']

The LSA summarization uses the algorithm outlined in (Gong & Liu, 2001)

The steps of the algorithm are:

  1. Decompose the document D into individual sentences, call this the candidate set S
  2. Construct the terms by sentence matrix A for document D, rows are terms and columns are sentences
  3. Perform the SVD on A to obtain the singular value matrix E and the right singular vector matrix $V^T$
  4. Select the k'th right singular vector from matrix $V^T$
  5. Select the sentence that has the highest loading on the k'th right singular vector
  6. Repeat until you get a summary of desired length

Note:

Recall, SVD decomposes $A = U \Sigma V^T$, where $\Sigma$ is a diagonal matrix and $U$ and $V^T$ are orthongonal matrices Therefore, $AA^T = U \Sigma V^TV \Sigma^T U^T = U \Sigma^2U^T$ and $A^TA = V \Sigma^TU^TU \Sigma V^T = V \Sigma^2V^T$. $\Sigma^2$ is a diagonal matrix, therefore it must be the case that $V$ contains the eigenvectors of $A^TA$ and $U$ contains the eigenvectors of $AA^T$.

The elements of $A^TA$ contain the dot products of each term across the document against the individual terms in each sentence. We can think of $A^TA$ as the "term covariance matrix" so that projecting the matrix onto a lower dimensional space finds the "latent" terms or topics that underlie the document.

This results in 1 out of the 5 manually generated summary sentences

In [37]:
lsasumr = sumr.LSASumr(CleanText)
lsasumr.make_term_sentence_matrix(doc_list[0])
lsasumr.get_singular_vector()
lsasumr.Summarize(doc_list[0])
Out[37]:
[u'First quarter 2009 adjusted net loss, excluding net realized capital gains (losses) and FAS 133 gains (losses), net of tax, was $1.60 billion, compared to an adjusted net loss of $3.56 billion in the first quarter of 2008.',
 u'AIG reported market disruption-related losses of $2.5 billion pre-tax ($1.6 billion after tax).',
 u'AIG Private Bank Ltd. (AIG Private Bank) to a subsidiary of Aabar Investments PJSC (Aabar), a global investment company based in Abu Dhabi, for approximately $253 million for the entire share capital of AIG Private Bank.',
 u'Reserves were also reduced by $287 million due to dispositions.',
 u'These decreases were partially offset by higher DAC benefits related to net realized capital losses.']